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The five point scale is the most frequentiv applied 
scaling used in thd current practices for evaluating instrublkor 
classfcoffl perf ormanjce through graduate student observations. Hfeiice> 
the investigation addressed itself toward determining, through a 
series'^ of 55 computerized exact randomization tests^ at what degree " 

several graduate student reported classroom 
significance at alpha .05 on a one-tailed 
Obviously, the primary intent was to sort^ 
reported observations so when, an instructor 
compared two means on himself from two classes or a comparison, 
between two instructors and their reported means were compared, such 
evidence was to^be nonrandom rather than random as usually required 
in behaivioral theory and analysis. The results indicated that an 
instructor would have to have a mean difference between 2.25 and 2.50 
to assure himself ^reasona bly that the reported observations on him 

^ were nonrandom. The' simulated vesnTtW-r^^ ^ 

constraints with the five point scale, making its practical 
appLication and interpretation most questio!nable. Two references are 
listed. (Author/JE) , ^ 



of m^an differences would 
mea/ns produce statistical 
test in either direction, 
out nonraiidom from random 
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ABSTRACT 



The five point scale is the most frequently applied scaling used'^-in 
the current practices for evaluating instructor. classroom performance 
through graduate student observations. Hence, this investigation addressed 
itself toward .determining, through a aeries of computerized exact randomization 
tests^at what degree of mean differences would several graduate student 
reported classroom means pjroduce statistical significance at alpha .05 on 
a one-tailed test in either direction. Obviously, the primary intent was 
to. sort out ;nQ.nrandom from random reported observations so when an instructor/ 
compared two means on himself from two classes pr a comparison between two ^ 
instructors and their reported means were compared, such evidence was to be 
nonrandom rather than. random as usually required in behavioral theory and 
analysis. 

The simulated results revealed several severe constraints with the 
five point scale, making its practical applicationNand interpretation most 
questionable. 

Graduate student evaluations of instructor classroom performance seem 
now to be routine procedures under the current quantification notions of 
behavioral accountability, including instructor-stated "instructional 
objectives" with .this teaching to be assessed through' graduate student 
observations through some "rating scale." Particularly, student evaluation 
forms usu^ally contain the rather flabby, non-operationallzed item, "rate 
the overall teaching ability of this instructor" or "considering every- ^ . 
thing, how do you rate the teaching ability of this instructor" on a five 
point: scale. '-'-From these questionable observations, means, standard 
deviations and other statistics are computed and then surreptitious 
administrative comparisons between an instructor's own Vourses as well as 
between two instructor's courses are accomplished. \ 



To say nothing of these "apples :^nd oranges" comparisons, the proverbial 
0--5 point scale itself was subjected in this simulation i^vestigation to \ 
statistical, mean differences comparisons thrpugh the coopitjiter-^programmed \ 
Lohnes and Cooley (1968) exact randomization test. As the authors claimed', • 
this program computed a replacement random sample of 200 p<^ints from the 
possible t-test outcomes of assigning n scores in two group's (samples) 
In all combinations of n things assigned n/2 at a time. v 



Randomization. tests, according to Sie§ol (195.6), were the most 
powerful non-parametric techniques whenever measurements were so precise 
as to give the scores numerical meanings. Did graduate student class 
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means on an instructorVs classroom performance have such numerical meaning? 
And these means, foremost, had to be non-random in their outcomes according to 
behavioral theory to obtain such numerical meaning. Therefore, in this 
investigation, the level of statistical significance was set at the proverbial 
alpha .05. Because the exact randomization test used all the information in 
its nbn-ratndOTuly selected samples, for two independent samples, the exact 
randomization test had a power efficiency of 100 per' cent (Siegel, 1956). 
Wow, at how much difference would Professor Everyman have to realize with two 
of his class means at alpha .05 to convince himself that the tv/a obtained 
means on him were statistically significant ,-^ven if one course he taught was 
in educational statistics and the other In educational history, thus leading 
to a possible "apple and oranges*' comparison nevertheless? 

For the computer runs, whose results are reported in the table, an n of 
eight was selected on the premise that an instructor had a normal teaching 
load of twelve contact hours with four- classes. He thus had four classes one 
semester and four the next* Each score fed into the computer with the n of 
eight was greater than zero and less than five; Obviously. an infinite number , 
of scores between zero and five were possible, but the data in the table do, 
it is believed, establish reasonable limits for the comparison by a given . 
instructor of his overall performance for one semester against another and, at 
the same time, to effect a comparison between tvjo instructors despite the 
"apples and oranges" limitations. After all, the principal intent in this 
investigation was to find mean difference limits. Thus for a second insight, 
what mean difference would be required to assert that Professor Excellent *s 
overall mean was statistically significant at alpha .05 from Professor Poor's 
mean, despite the fact that one mi ght b e in the physics department, while the 
othe^r is in engineering? | 

The range of means, as indicated in the table, was from .25 to 4.75 on • 
the five point scale. The, total possible number of outcomes for an jn of eight 
(four course means on either side) resulted in: 

n! / (n/2!)2 or 8! / (4!)2 or 40,320/576 = 70 

Therefore seventy computer runs were possible. Fifty-five were actually 
completed for the mean difference 2.25 - 2.50 established the zones between 
statistical significance at alpha .05 oh a one-tailed test in e:^ther dir ^ion. 
A few runs shown in the table represent , duplication for confirmation as wexl 
as a fev; runs, the intercliange of M| - M2 for ~ Mj^, to check on direction - 
in the randomization. 

\ - " ■-. 

The Lohnes and Cooley program produced 200 t-distributions on each run. 
According to the authors^ "A nice thing about 200 outcomes is ' that .01 times 
the order number (or rank) of the randomization outcome equal to or closest 
to (on the small side) the absolute value of the obtained Jt is the two-tailed 
probability of the actual outcome of the experiment on the null hypothesis 
that randoinlzatlon alone explains the group differ ence," 

Mean differences of 2.0, more often than not, produced non-significant 
probabilities at alplia .05 on a_one-tailed test. Obviously, the size of the 
standard error of the mean difference in the formula, t^ean difference / 
standard error of t^e mean difference, was somewhat ^controlling. This cited 
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formula Is more 



often known as the expression t = M]^ - M^^j/sVe; 



•Ml + s.e.jil. 



Data on Fifty-five Exact Randomlg^tlon Tests 
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test) 
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On the other hand, mean differences of 2.5 on the five point scale produced 
statistically significant results at alpha .05, with thd obtained probabilities 
being .01 or .02 in either direction on a one-tailed tes^. As shoXm in the 
table, meai) differences greater than 2.5 produced statistical significance, 
while mean differences less than 2.0 did not. \ 

' ■ - ^' \ • 

Mean differences of 2.25 seemed to be in the penumbri area,* producing 
probabilities from .02 to .08 on a one-tailed test in either direction. Thus 
the zones in which Ty^>e I' and Type II errors were^ in general, being produced 
were also somewhat identified. 

What would all the above in part indica^eT^^n instructor would, have to- 
realize a mean difference between 2.25 - 2*50 ©r greater to assure himself 
reasonably that the reported observations on him were non-random. 

Thus, if one class mean were 4.5, the lower meaii would have to be 2.25, 
that is, 4.5 - 2.25 = 2.25. Or with a mean difference of 2.5, if the higher 
-mean were 4.5, the lower mean would have to be 2.0, that is, 4.5 -2.5 = 2.0. 
The same could be as/Serted for the more questionable comparison between 
Professor Excellent and Professor Poor, where, it is held, the 'apples and ^ 
oranges" comparison would be further magnified because of situational T 
differences, including course content, class size, disciplines, and so on. 



At my institution, the reported graduate student data I have seen over 
a four year period have indicated that graduate students are reluctant to use 
the higher end as well as the lower end of the five point scale, therefore, 
not many 4.5 to ^.0 nor 0.0 to 2.5 means are produced. As a matter of fact,, 
I have! never seen a- mean of 3.0 or less. Since under behavioral theory 
statistical significance must be insisted upon in order to separate^andiom 
from non- random propositions and/or outcomes, the accountability advocates 
might take another look at scaling and behavioral numbers game in their efforts 
to quantify teacher performance through student observational data. 
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